Purpose: Revenue vs AdSpend effect

Author: Gerardo Gandara

Contact:

Client:

Code created: 2023-10-11

Last updated: 2023-10-11

Source: https://github.com/ggandara13/myrepository

Comment: https://getrecast.com/modern-media-mix-modeling/.

Libraries

These are the libraries to load

pkg <- c('pivottabler', 'tidyverse','tibbletime','anomalize','timetk')
#install.packages(pkg)

library(pivottabler)
library(lubridate)
library(dplyr)
library(ggplot2)
library(tidyverse)
library(tibbletime)
library(anomalize)
library(timetk)
library(forecast)

Read the revenue and AdSpend

We check the date type of the ‘date’ column and change the type to DATE

The CSV data is avaialble in the github repository


setwd("/Users/gerardogandara/Documents/recast/")

## Read file as CSV
df_spend <- read.csv('https://raw.githubusercontent.com/ggandara13/myrepository/master/acme_spend.csv')
df_revenue <- read.csv('https://raw.githubusercontent.com/ggandara13/myrepository/master/acme_revenue.csv')

# Convert char format to DATE format
df_revenue$date <- as.Date(df_revenue$date, format =  "%m/%d/%y")
df_spend$date <- as.Date(df_spend$date, format =  "%m/%d/%y")

# Check the format and content of each dataframe
str(df_revenue)
'data.frame':   1035 obs. of  5 variables:
 $ X              : int  1 2 3 4 5 6 7 8 9 10 ...
 $ date           : Date, format: "2020-01-01" "2020-01-02" "2020-01-03" ...
 $ revenue_dtc    : num  16000 19306 22052 23177 26169 ...
 $ revenue_amazon : num  50000 60332 68914 72428 81778 ...
 $ revenue_walmart: num  34000 41026 46861 49251 55609 ...
str(df_spend)
'data.frame':   10350 obs. of  4 variables:
 $ X      : int  1 2 3 4 5 6 7 8 9 10 ...
 $ date   : Date, format: "2020-01-01" "2020-01-01" "2020-01-01" ...
 $ channel: chr  "facebook_prospecting" "facebook_retargeting" "google_branded_search" "google_nonbranded_search" ...
 $ spend  : num  4154 2679 474 3817 7421 ...

Check there are not NULL values


paste("Total of NULL in Spend:", sum(is.na(df_spend)))
[1] "Total of NULL in Spend: 0"
paste("\nTotal of NULL in Revenue:", sum(is.na(df_revenue)))
[1] "\nTotal of NULL in Revenue: 0"
paste("\n\nSummary\n")
[1] "\n\nSummary\n"
# you can check with summary as well and see the columns
summary(df_spend)
       X              date              channel              spend            Year     
 Min.   :    1   Min.   :2020-01-01   Length:10350       Min.   :    0   Min.   :2020  
 1st Qu.: 2588   1st Qu.:2020-09-15   Class :character   1st Qu.: 1278   1st Qu.:2020  
 Median : 5176   Median :2021-06-01   Mode  :character   Median : 5268   Median :2021  
 Mean   : 5176   Mean   :2021-06-01                      Mean   : 7740   Mean   :2021  
 3rd Qu.: 7763   3rd Qu.:2022-02-15                      3rd Qu.:11649   3rd Qu.:2022  
 Max.   :10350   Max.   :2022-10-31                      Max.   :36401   Max.   :2022  
summary(df_revenue)
       X               date             revenue_dtc     revenue_amazon   revenue_walmart 
 Min.   :   1.0   Min.   :2020-01-01   Min.   :     0   Min.   :     0   Min.   :     0  
 1st Qu.: 259.5   1st Qu.:2020-09-15   1st Qu.: 41643   1st Qu.:130216   1st Qu.: 88547  
 Median : 518.0   Median :2021-06-01   Median : 72051   Median :225161   Median :153109  
 Mean   : 518.0   Mean   :2021-06-01   Mean   : 72741   Mean   :227062   Mean   :154488  
 3rd Qu.: 776.5   3rd Qu.:2022-02-14   3rd Qu.: 97340   3rd Qu.:303352   3rd Qu.:206279  
 Max.   :1035.0   Max.   :2022-10-31   Max.   :193005   Max.   :415952   Max.   :282848  
sum(is.na(df_spend))/nrow(df_spend)*100
[1] 0
sum(is.na(df_revenue))/nrow(df_revenue)*100
[1] 0

Since we have 1,035 rows, we can remove because that represents less than 1% or just replace with zero

Now, let’s replace with zero since it is a time series


df_revenue[is.na(df_revenue)] <- 0
df_spend[is.na(df_spend)] <- 0


cat("Total of NULL in Spend:", sum(is.na(df_spend)))
Total of NULL in Spend: 0
cat("\nTotal of NULL in Revenue:", sum(is.na(df_revenue)))

Total of NULL in Revenue: 0

1- Which channel had the largest increase in spend in 2022 compared to the same date range in 2021?

Using lubridate to filter the year of the date, also use the pivottabler library to group/aggregate it

Answer: online_video

# print the first value of the dataframe
paste("The channel with highest LIFT is:", row.names(df_summary_final)[1] )
[1] "The channel with highest LIFT is: online_video"

2- In terms of total revenue, are there any anomalous days?

Group by day to see the trend and plot


df_revenue_total <- df_revenue %>%
  mutate(total_revenue = ( df_revenue$revenue_dtc + df_revenue$revenue_amazon + df_revenue$revenue_walmart  ))

# Group by sum using R Base aggregate()
agg_df <- aggregate(df_revenue_total$total_revenue, by=list(df_revenue_total$date), FUN=sum)
colnames(agg_df) <- c('date','revenue') 


#Plot sales over 36 months
ggplot(agg_df, aes(x=date, y = revenue)) + geom_line()


paste("Visually we can see some outliers")
[1] "Visually we can see some outliers"

Now, we can use the anomalties library to identify the values

Use time_decompose() to decompose a time series prior to performing anomaly detection with anomalize(). Typically, anomalize() is performed on the “remainder” of the time series decomposition.

The return has three columns: “remainder_l1” (lower limit for anomalies), “remainder_l2” (upper limit for anomalies), and “anomaly” (Yes/No).

We can print the detail of the anomalies, for example the last 4

tail(df_anomalized %>%
  filter(anomaly == 'Yes'),4)
NA

You can also use the timetk package for dynamic plot

agg_df %>% timetk::plot_anomaly_diagnostics(date,revenue, .facet_ncol = 2)
frequency = 7 observations per 1 week
trend = 92 observations per 3 months

3- In which month of the year does Acme tend to make the most revenue?

We can compute month column before we aggregate and sort to get the popular month

Most profitale month in a year is the month number: 3


# Calculate the month column
monthfunction<-function(dataframe, datecolumn) {
    month(dataframe[,datecolumn])
}
agg_df$Month <- monthfunction(agg_df, "date")

# group by month

by_month <- aggregate(agg_df$revenue, by = list(agg_df$Month), FUN = sum)
colnames(by_month) <- c('month','revenue') 



by_month <- by_month[order(by_month$revenue, decreasing = TRUE), ]

# print the first value of the dataframe
print(by_month)
paste("Most profitale month in a year is the month number:", row.names(by_month)[1] )
[1] "Most profitale month in a year is the month number: 3"

4- Does Acme’s marketing spend tend to follow a similar pattern to revenue?

We need to merge both dataframes to have the TS of Revenue using AdSped as regressor Convert the dataframe to TS then plot, then plot the results

Visually we can confirm they have a similar pattern.


#select the date and total_revenue of Acme
df_only_revenue <- select(df_revenue_total, date, total_revenue)

#aggregate spend of all media
df_spend_total <- aggregate(df_spend$spend, by=list(df_spend$date), FUN=sum)
colnames(df_spend_total) <- c('date','spend') 

df_rev_spend <- merge(df_only_revenue, df_spend_total, by = "date")


#Convert dataframe to time series object using the ts() function
acme_ts <- ts(data = df_rev_spend[,c(2,3)])


# Time plot of both variables
autoplot(acme_ts, facets = TRUE)

---
title: "Assignment-RECAST Modern Marketing Mix Modeling "
output:
  html_document:
    df_print: paged
---

###########################################################################
#                                                                         #
# Purpose:       Revenue vs AdSpend effect                                #
#                                                                         #
# Author:        Gerardo Gandara                                          #
# Contact:       gerardo.gandara@gmail.com                                #
#                                                                         #
# Client:        andy@getrecast.com                                       #
#                                                                         #
# Code created:  2023-10-11                                               #
# Last updated:  2023-10-11                                               #
# Source:        https://github.com/ggandara13/myrepository               #
#                                                                         #
# Comment:       https://getrecast.com/modern-media-mix-modeling/.        #
#                                                                         #
###########################################################################


# Libraries
These are the libraries to load


```{r}
pkg <- c('pivottabler', 'tidyverse','tibbletime','anomalize','timetk')
#install.packages(pkg)

library(pivottabler)
library(lubridate)
library(dplyr)
library(ggplot2)
library(tidyverse)
library(tibbletime)
library(anomalize)
library(timetk)
library(forecast)

```



Read the revenue and AdSpend

We check the date type of the 'date' column and change the type to DATE

The CSV data is avaialble in the github repository


```{r}

setwd("/Users/gerardogandara/Documents/recast/")

## Read file as CSV
df_spend <- read.csv('https://raw.githubusercontent.com/ggandara13/myrepository/master/acme_spend.csv')
df_revenue <- read.csv('https://raw.githubusercontent.com/ggandara13/myrepository/master/acme_revenue.csv')

# Convert char format to DATE format
df_revenue$date <- as.Date(df_revenue$date, format =  "%m/%d/%y")
df_spend$date <- as.Date(df_spend$date, format =  "%m/%d/%y")

# Check the format and content of each dataframe
str(df_revenue)
str(df_spend)

```


Check there are not NULL values 

```{r}

paste("Total of NULL in Spend:", sum(is.na(df_spend)))

paste("\nTotal of NULL in Revenue:", sum(is.na(df_revenue)))

paste("\n\nSummary\n")
# you can check with summary as well and see the columns
summary(df_spend)
summary(df_revenue)

sum(is.na(df_spend))/nrow(df_spend)*100
sum(is.na(df_revenue))/nrow(df_revenue)*100


```
Since we have 1,035 rows, we can remove because that represents less than 1% or just replace with zero

Now, let's replace with zero since it is a time series
```{r}

df_revenue[is.na(df_revenue)] <- 0
df_spend[is.na(df_spend)] <- 0


cat("Total of NULL in Spend:", sum(is.na(df_spend)))
cat("\nTotal of NULL in Revenue:", sum(is.na(df_revenue)))


```




## 1- Which channel had the largest increase in spend in 2022 compared to the same date range in 2021?

Using lubridate to filter the year of the date, also use the pivottabler library to group/aggregate it

Answer: online_video


```{r}
yearfunction<-function(dataframe, datecolumn) {
    year(dataframe[,datecolumn])
}
df_spend$Year <- yearfunction(df_spend, "date")


df <- df_spend %>% group_by(channel) %>% 
  filter(lubridate::year(date) %in% c(2021, 2022) )
df <- as.data.frame(df)


# initialice the PIVOT
pt <- PivotTable$new()
# add the df with 2 years
pt$addData(df)
pt$addColumnDataGroups("Year")
pt$addRowDataGroups("channel")
pt$defineCalculation(calculationName="TotalSpend", summariseExpression="sum(spend)")
pt$evaluatePivot()

#covnert to dataframe
df1 <- pt$asDataFrame()
df_summary_final <- df1 %>%
  mutate(index_lift = ( (df1[,2]/ df1[,1]) - 1 ) *100  )

df_summary_final<- df_summary_final[order(df_summary_final$index_lift, decreasing = TRUE), ]


# print the first value of the dataframe
paste("The channel with highest LIFT is:", row.names(df_summary_final)[1] )


```




## 2- In terms of total revenue, are there any anomalous days?

Group by day to see the trend and plot


```{r}

df_revenue_total <- df_revenue %>%
  mutate(total_revenue = ( df_revenue$revenue_dtc + df_revenue$revenue_amazon + df_revenue$revenue_walmart  ))

# Group by sum using R Base aggregate()
agg_df <- aggregate(df_revenue_total$total_revenue, by=list(df_revenue_total$date), FUN=sum)
colnames(agg_df) <- c('date','revenue') 


#Plot sales over 36 months
ggplot(agg_df, aes(x=date, y = revenue)) + geom_line()

paste("Visually we can see some outliers")
```

Now, we can use the anomalties library to identify the values

Use time_decompose() to decompose a time series prior to performing anomaly detection with anomalize(). Typically, anomalize() is performed on the "remainder" of the time series decomposition.

The return has three columns: "remainder_l1" (lower limit for anomalies), "remainder_l2" (upper limit for anomalies), and "anomaly" (Yes/No).



```{r}
# Convert df to a tibble
df <- as_tibble(agg_df)
class(df)

df_anomalized <- df %>%
    time_decompose(revenue, method = "stl", merge = TRUE) %>%
    anomalize(remainder, method = "iqr", alpha = 0.05, max_anoms = 0.2) %>%
    time_recompose()

#We can then visualize the anomalies using the plot_anomalies() function. 
df_anomalized %>% plot_anomalies(ncol = 1, alpha_dots = 0.75)

```

We can print the detail of the anomalies, for example the last 4


```{r}
tail(df_anomalized %>%
  filter(anomaly == 'Yes'),4)
```

You can also use the timetk package for dynamic plot

```{r}
agg_df %>% timetk::plot_anomaly_diagnostics(date,revenue, .facet_ncol = 2)

```


## 3- In which month of the year does Acme tend to make the most revenue?

We can compute month column before we aggregate and sort to get the popular month

Most profitale month in a year is the month number: 3

```{r}

# Calculate the month column
monthfunction<-function(dataframe, datecolumn) {
    month(dataframe[,datecolumn])
}
agg_df$Month <- monthfunction(agg_df, "date")

# group by month

by_month <- aggregate(agg_df$revenue, by = list(agg_df$Month), FUN = sum)
colnames(by_month) <- c('month','revenue') 



by_month <- by_month[order(by_month$revenue, decreasing = TRUE), ]

# print the first value of the dataframe
print(by_month)
paste("Most profitale month in a year is the month number:", row.names(by_month)[1] )

```


## 4- Does Acme's marketing spend tend to follow a similar pattern to revenue?

We need to merge both dataframes to have the TS of Revenue using AdSped as regressor
Convert the dataframe to TS then plot, then plot the results

Visually we can confirm they have a similar pattern.

```{r}

#select the date and total_revenue of Acme
df_only_revenue <- select(df_revenue_total, date, total_revenue)

#aggregate spend of all media
df_spend_total <- aggregate(df_spend$spend, by=list(df_spend$date), FUN=sum)
colnames(df_spend_total) <- c('date','spend') 

df_rev_spend <- merge(df_only_revenue, df_spend_total, by = "date")


#Convert dataframe to time series object using the ts() function
acme_ts <- ts(data = df_rev_spend[,c(2,3)])


# Time plot of both variables
autoplot(acme_ts, facets = TRUE)

```






## 5- Based on what you see here, could you make the case that marketing spend is causally related to revenue? Why or why not?

Visually we can see the effect has high correlation, and when we calculate the R2 is 0.91 with p=value basically zero. So, it is statiscally correlated.


```{r}

library("ggpubr")

df_rev_spend

ggscatter(df_rev_spend, x = "spend", y = "total_revenue", 
          add = "reg.line", conf.int = TRUE, 
          cor.coef = TRUE, cor.method = "pearson",
          xlab = "Total AdSpend", ylab = "Total Revenue - 3 sources ")

```
Is the relation linear? Yes, form the plot above, the relationship is linear.

now we jsut need to verify that the data from each of the 2 variables (revenue, spend) follow a normal distribution.

```{r}

df_rev_spend


# Shapiro-Wilk normality test for revenue
shapiro.test(df_rev_spend$total_revenue) # => p-value < 2.2e-16

# Shapiro-Wilk normality test for spend
shapiro.test(df_rev_spend$spend) # => p-value < 2.2e-16


# revenue 
ggqqplot(df_rev_spend$total_revenue, ylab = "revenue")
# spend
ggqqplot(df_rev_spend$spend, ylab = "spend")
```

Since the p-value is practically zero, we can reject the null hypothesis that it is normal distributed both variables

so we can use another test to check asociation
Kendall rank correlation test - Kendall’s tau 

This is used to estimate a rank-based measure of association. This test may be used if the data do not necessarily come from a bivariate normal distribution.

The correlation coefficient between revenue and spend are 0.7209243 and the p-value < 2.2e-16.
Exist correlation

but we know it is not causation

```{r}
res2 <- cor.test(df_rev_spend$total_revenue, df_rev_spend$spend,  method="kendall")
res2
```


We know that they are highly correlated. We can use time series to see the magnitude of the effect of marketing using it as regressor

Visually in the previous plot of revenue, we can see it is stationary, and there is seasonality. We can use auto.arima to calculate the regressor factor

This is a very basic analysis, but it helps to understand in overall how the marketing efforts are creating an effect in the revenue.

1.489071 is the regressor factor

#### Interpretation of sales increase: For every $1,000 (the unit of ad spend is in thousands USD) increase in advertising spend, the unit sales increase by 1.4891 - It is a positive effect, so ACME probably is using RECAST and the marketing campaign is performaing well



```{r}
# we need to convert the daily data to weekly
weekno <- as.numeric(df_rev_spend$date - df_rev_spend$date[1]) %/% 7
Week <- df_rev_spend$date[1] + 7 * weekno
df_by_week <- aggregate(cbind(total_revenue,spend)  ~ Week, df_rev_spend, sum)

#to check it looks good
df_by_week

#Convert dataframe to time series object using the ts() function
acme_ts_week <- ts(data = df_by_week[,c(2,3)])
# Time plot of both variables
autoplot(acme_ts_week, facets = TRUE)

# Fit ARIMA model
fit <- auto.arima(acme_ts_week[, "total_revenue"], xreg = acme_ts_week[, "spend"], 
                  stationary = TRUE, seasonal=TRUE, 
                  allowdrift = FALSE, allowmean = FALSE )#,lambda = Lambda)

fit

#The auto.arima function has fit a linear regression model to the advertising variable and an an ARIMA model to the time series (date) variable. ARIMA(2,0,1)

#What is the increase in sales for each unit increase in advertising?
sales_increase <- coefficients(fit)[4]
sales_increase




#Check Residuals to verify that the residuals are white noise and normal distributed - Indeed they are so the model is acceptable
checkresiduals(fit)

```



